1. Data Collection and Preprocessing

Python
Machine Learning
Data Visualization
Published

September 21, 2025

Getting Started with the Pokemon TCG Dataset

I am using the dataset from the pokemontcg.io API service. They have a github repository with all the data in JSON format that can be downloaded directly. After downloading and extracting the data, here is my project file structure:

.
├── data
│   └── json
│       ├── cards
│       │   └── en
│       │       ├── base1.json
│       │       ├── ...
│       │       └── zsv10pt5.json
│       └── sets
│           └── en.json
└── part1.ipynb

We see all the card data is seperated into their respective set. So lets first see how many cards we are working with by looping through all the JSON files in the data/json/cards/en/ directory and counting the number of cards in each file.

import os
import json

all_cards_dataset = []
json_files = os.listdir('data/json/cards/en/')

for file in json_files:
  if file.endswith('.json'):
    with open(os.path.join('data/json/cards/en/', file), 'r') as f:
      data = json.load(f)
      all_cards_dataset.extend(data)

print(f"Total number of cards in dataset: {len(all_cards_dataset)}")
Total number of cards in dataset: 19653

For this project 19653 cards is too much for this analysis. I am going to take only the pokemon cards from the pokemons of the first generation (the original 151 pokemons) to perform my analysis. This should give me enough data to work with and allows the data to span through multiple expansions and sets while keeping the dataset small. In future analysis I might expand to all pokemon cards.

Lets first convert all the JSON files to a single CSV file containing all the pokemon cards and then filter that CSV file to only contain the first generation pokemons.

import pandas as pd

df = pd.json_normalize(all_cards_dataset)
df.to_csv('data/all_pokemon_cards.csv', index=False)

To find all the pokemon cards related to the pokemons from the first generation, I need a list of the first generation pokemons. I found a list on wikipedia and copied them into a txt file. This gives me a list of the pokemon names to filter the cards by. Here is the list below:

Bulbasaur Ivysaur Venusaur Charmander Charmeleon Charizard Squirtle Wartortle Blastoise Caterpie Metapod Butterfree Weedle Kakuna Beedrill Pidgey Pidgeotto Pidgeot Rattata Raticate Spearow Fearow Ekans Arbok Pikachu Raichu Sandshrew Sandslash Nidoran-Female Nidorina Nidoqueen Nidoran-Male Nidorino Nidoking Clefairy Clefable Vulpix Ninetales Jigglypuff Wigglytuff Zubat Golbat Oddish Gloom Vileplume Paras Parasect Venonat Venomoth Diglett Dugtrio Meowth Persian Psyduck Golduck Mankey Primeape Growlithe Arcanine Poliwag Poliwhirl Poliwrath Abra Kadabra Alakazam Machop Machoke Machamp Bellsprout Weepinbell Victreebel Tentacool Tentacruel Geodude Graveler Golem Ponyta Rapidash Slowpoke Slowbro Magnemite Magneton Farfetch'd Doduo Dodrio Seel Dewgong Grimer Muk Shellder Cloyster Gastly Haunter Gengar Onix Drowzee Hypno Krabby Kingler Voltorb Electrode Exeggcute Exeggutor Cubone Marowak Hitmonlee Hitmonchan Lickitung Koffing Weezing Rhyhorn Rhydon Chansey Tangela Kangaskhan Horsea Seadra Goldeen Seaking Staryu Starmie Mr-Mime Scyther Jynx Electabuzz Magmar Pinsir Tauros Magikarp Gyarados Lapras Ditto Eevee Vaporeon Jolteon Flareon Porygon Omanyte Omastar Kabuto Kabutops Aerodactyl Snorlax Articuno Zapdos Moltres Dratini Dragonair Dragonite Mewtwo Mew

We can then use this list to filter the CSV file we created earlier to only contain the first generation pokemons. Since pokemon names could be named: “Cool Porygon” we will use the str.contains method to filter the names that contain any of the first generation pokemon names. Although this is not perfect, it should give us a good enough dataset to work with.

# This text file contains the names of all first generation pokemons
with open('data/first_gen_pokemons.txt', 'r') as f:
  first_gen_pokemons = f.read().split()
  first_gen_pokemons = '|'.join(first_gen_pokemons)

df = pd.DataFrame(all_cards_dataset)
filtered_df = df[df['name'].str.contains(first_gen_pokemons)]
filtered_df = filtered_df[filtered_df['supertype'] == 'Pokémon'] # Keep only Pokémon cards
Shape of our dataframe: (4470, 25)

We have taken our intial dataset of 19653 cards and filtered it down to 4470 cards. And now that we have filtered the dataset, we can move on to cleaning and processing the data. Lets save this filtered dataset to a CSV file for part 2.

filtered_df.to_csv('data/first_gen_pokemon_cards.csv', index=False)